refactor(BA-5744): migrate kernel live_stat from Valkey to Prometheus#11330
refactor(BA-5744): migrate kernel live_stat from Valkey to Prometheus#11330seedspirit wants to merge 6 commits into
Conversation
18676b6 to
323c1d1
Compare
There was a problem hiding this comment.
Pull request overview
Migrates the legacy GraphQL live_stat resolver path from Valkey-based session statistics to Prometheus-backed container live-stat queries, while preserving the existing dict[metric_name, MetricValue] wire shape for WebUI/GQL consumers.
Changes:
- Replaced
KernelNode.batch_load_live_statValkey calls with a Prometheus-based batch loader (_batch_load_kernel_live_stat) and a newLegacyLiveStatConverter. - Moved legacy
MetricValue/MovingStatValueintoai.backend.common.metrics.types, adding metric classifications andresolve_unit_hint()for unit derivation. - Added unit tests covering conversion behavior across gauge/rate/diff metrics, pct derivation, defaults, and multi-kernel isolation.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/manager/api/gql_legacy/test_stat_converter.py | Adds unit tests for the legacy live_stat converter behavior. |
| src/ai/backend/manager/repositories/metric/repository.py | Minor refactor of Prometheus query result unpacking for live stats. |
| src/ai/backend/manager/clients/prometheus/fixed_query_builder.py | Removes unreachable guard in template selection for metric types. |
| src/ai/backend/manager/api/gql_legacy/statistics.py | Removes an unused/legacy batch loader wrapper method. |
| src/ai/backend/manager/api/gql_legacy/stat_converter.py | Introduces LegacyLiveStatConverter to adapt Prometheus results to legacy shape. |
| src/ai/backend/manager/api/gql_legacy/kernel.py | Switches live_stat resolver to Prometheus action + legacy conversion via dataloader. |
| src/ai/backend/common/types.py | Removes legacy MetricValue/MovingStatValue from the common types module. |
| src/ai/backend/common/metrics/types.py | Adds MetricValue/MovingStatValue + unit hint resolution and metric classifications. |
| src/ai/backend/common/clients/valkey_client/valkey_stat/client.py | Updates imports to use the moved MetricValue type. |
| src/ai/backend/client/output/formatters.py | Updates imports to use the moved MetricValue type. |
| src/ai/backend/appproxy/worker/types.py | Updates imports to use the moved MetricValue/MovingStatValue types. |
| src/ai/backend/agent/stats.py | Updates imports to use the moved MetricValue/MovingStatValue types. |
| changes/11330.enhance.md | Adds changelog entry for the migration. |
Comments suppressed due to low confidence (1)
src/ai/backend/manager/clients/prometheus/fixed_query_builder.py:154
_get_template()no longer has a fallback branch. If an unexpected/newMetricTypevalue ever reaches this function (e.g., enum extended in the future), Python will implicitly returnNone, which then propagates intoMetricPreset(template=...)and fails later with a less clear error. Please add an explicit default case that raises (e.g.,UnreachableError/ValueError) so failures are immediate and actionable.
def _get_template(self, metric_type: MetricType) -> str:
match metric_type:
case MetricType.GAUGE:
return _GAUGE_TEMPLATE
case MetricType.RATE:
return _RATE_TEMPLATE
case MetricType.DIFF:
return _DIFF_TEMPLATE
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| MetricValue = TypedDict( | ||
| "MetricValue", | ||
| { | ||
| "current": str, | ||
| "capacity": str, | ||
| "pct": str, | ||
| "unit_hint": str, | ||
| "stats.min": str, | ||
| "stats.max": str, | ||
| "stats.sum": str, | ||
| "stats.avg": str, | ||
| "stats.diff": str, | ||
| "stats.rate": str, | ||
| "stats.version": int | None, | ||
| }, | ||
| ) |
There was a problem hiding this comment.
MetricValue now defines capacity as a required str, but there are still in-repo producers/consumers that treat capacity as nullable (e.g., alembic stats migration sets "capacity": None, and the CLI formatter checks metric["capacity"] is not None). This makes the TypedDict inconsistent with real payloads and will either break type-checking or force unsafe casts. Please either keep capacity as str | None (like the previous definition) or update all producers to always emit a string (e.g., "0") and remove/adjust the None handling accordingly.
Reflects review: the converter is stateless, so callers shouldn't instantiate it on every batch load. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Main moved the type from `manager.data.metric.types` to `common.clients.prometheus.metric_types`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
cde8b1f to
1a2dbeb
Compare
| Merge order from upstream is gauge -> diff -> rate, so for | ||
| RATE/DIFF metrics the same `(name, CURRENT)` tuple appears twice; | ||
| `currents[0]` is the raw gauge sample, `currents[-1]` is the | ||
| rate/diff query result. |
There was a problem hiding this comment.
In a future PR, I plan to refactor the code so that it can be converted from an index-based to a type-based approach. Since this would require modifying the response merge logic, I did not include it in this PR.
Summary
KernelNode.batch_load_live_stat(Valkey →valkey_stat.get_session_statistics_batch) is replaced by_batch_load_kernel_live_stat, which calls the metric processor and adapts the result through a newLegacyLiveStatConverter. Wire shape (dict[metric_name, MetricValue]) is preserved so GQL/WebUI consumers stay compatible.MetricValue/MovingStatValuemove fromcommon/types.pytocommon/metrics/types.pynext to the newRATE_STAT_METRICS/DIFF_STAT_METRICSclassifications andresolve_unit_hint()helper (with naming-convention fallback so plugin metrics still get a usable unit hint).LegacyLiveStatConverterunit tests covering gauge / rate / diff / capacity-default / pct-derivation / unknown-metric / multi-kernel isolation.Known wire-level gaps
The remaining gaps are addressed by follow-up PRs that wire additional sources into the same converter; this PR's converter will pick them up as those land.
current/pct(with capacity) /unit_hint/stats.diff/stats.ratecapacityforcpu_used/net_*/io_*(and dependentpct)CAPACITYsample as-is, so it picks up the synthesized value automatically once #11535 lands.stats.max(all metrics) /stats.avg(cpu_util,cuda_util, plugin accel metrics)*_mem/*_util/*_power/*_temperature), socuda_*and other accelerator families getstats.max/stats.avgfor free. Converter wiring (mapping newvalue_type=max/value_type=avgsamples into the legacystats.max/stats.avgslots) is the remaining piece on top of #11360.*_clock/*_voltage)_ACCEL_GAUGE_SUFFIXES_*incommon/clients/prometheus/metric_types.py).Test plan
pants test tests/unit/manager/api/gql_legacy/test_stat_converter.pypants checkon the modified filespants fmt/pants lintcleanscripts/test-live-stat-equivalence.shagainst a live session (stash → A label, pop → B label, difflive-stat-eqv/Avslive-stat-eqv/B)🤖 Generated with Claude Code